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Abstract 

Background: In genetic association study of quantitative traits using models, how to code the marker 
genotypes and interpret the model parameters appropriately is important for constructing hypothesis tests and 
making statistical inferences. Currently, the coding of marker genotypes in building models has mainly focused 
on the biallelic case. A thorough work on the coding of marker genotypes and interpretation of model parameters 
for models is needed especially for genetic markers with multiple alleles. 

Results: In this study, we will formulate F„ genetic models under various regression model frameworks and 
introduce three genotype coding schemes for genetic markers with multiple alleles. Starting from an allele-based 
modeling strategy, we first describe a regression framework to model the expected genotypic values at given 
markers. Then, as extension from the biallelic case, we introduce three coding schemes for constructing fully 
parameterized one-locus Fo,, models and discuss the relationships between the model parameters and the 
expected genotypic values. Next, under a simplified modeling framework for the expected genotypic values, we 
consider several reduced one-locus F„ models from the three coding schemes on the estimability and 
interpretation of their model parameters. Finally, we explore some extensions of the one-locus F,^ models to two 
loci. Several fully parameterized as well as reduced two-locus F„ models are addressed. 

Conclusions: The genotype coding schemes provide different ways to construct models for association testing 
of multi-allele genetic markers with quantitative traits. Which coding scheme should be applied depends on how 
convenient it can provide the statistical inferences on the parameters of our research interests. Based on these 
models, the standard regression model fitting tools can be used to estimate and test for various genetic effects 
through statistical contrasts with the adjustment for environmental factors. 



Background 

Genetic markers with multiple alleles are common phe- 
nomena in genetic studies. It is well known that the 
ABO blood types in human are determined by three 
alleles at a genetic locus on chromosome 9. Molecular 
markers such as microsatellites often have multiple 
alleles. The major histocompatibility complex (MHC), a 
highly polymorphic genome region that resides on the 
human chromosome 6, encompasses multiple genes that 
encode for many human leukocyte antigens (HLA) and 
play an important role in regulation of the immune 
responses. Depending on the resolution level of allele 
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typing, each of the HLA- A, B, C, DR, DQ and DP gene 
loci could contain tens to hundreds of allele types. In 
addition, in the haplotype analysis of single-nucleotide 
polymorphisms (SNPs), various haplotypes from a set of 
SNPs can also be treated as different alleles from a 
'super' marker locus that consists of the set of SNPs. 

Presently, there are mainly three types of genetic mod- 
els that are commonly used in the genetic analysis of 
quantitative traits. One is Fisher's analysis of variance 
(ANOVA) models that focus on a decomposition of the 
genotypic variance into genetic variance components 
contributed by various genetic effects at quantitative 
trait loci (QTL) [1-6]. Another is the Foo models that 
concentrate on direct statistical modeling of the 
expected genotypic values at target genetic markers or 
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QTL and the association testing of various genetic 
effects. The other one is the so-called functional genetic 
models that emphasize on modeling the functional 
effects of genes [7]. Both Fisher's and ¥^ models can be 
referred to as statistical models, while the functional 
genetic models have fundamentally different objectives 
and estimation methods from the statistical models. A 
considerable amount of discussion has been made about 
the distinction between these different types of genetic 
models [8-11]. 

The F„ models have been widely used in genetic asso- 
ciation studies of quantitative traits. In building F„„ mod- 
els, how to code genotypes at a marker (or QTL) and 
interpret the model parameters are fundamental issues 
for constructing appropriate testing hypotheses and 
making correct statistical inferences. While the Fisher's 
ANOVA models can be directly applicable to genetic 
markers with multiple alleles, the F„„ models by contrast 
have been mainly discussed in the biallelic case [1,9,12]. 
For haplotype analysis, Zaykin et al. in [13] proposed a 
simple coding which included only the additive effects 
of haplotypes but ignored their interactions. More 
recently, Yang et al. in [11] explored an extension of the 
biallelic models to multi-allele models with a focus 
on the definition of various genetic effects and their 
relationships with the average genetic effects defined in 
the Fisher's models. A thorough work on coding of mar- 
ker genotypes and interpretation of model parameters 
for F„ models has not been done in the past especially 
for genetic markers with multiple alleles. 

In general, there are two different strategies in coding 
the marker or QTL genotypes. One is to treat each mar- 
ker or QTL as a potential risk factor with its genotypes 
as the risk units. Then, similar to the strategy in hand- 
ling categorical covariates in classical regression models, 
at each locus we can create one dummy variable per 
genotype and then include all but one (as the reference) 
of these dummy variables into a model. But this geno- 
type coding is often limited by the available sample sizes 
especially when the number of alleles at the marker 
locus is large. Alternatively, as alleles are often supposed 
to be the basic genetic risk units that may contribute to 
disease phenotypes in genetic studies, we may want to 
treat alleles at each marker or QTL as the risk units and 
examine the effects of alleles. However, genetic data has 
some specialty that needs to be taken into account in 
order to build the allele-based models. In the genome of 
diploid species such as human being, alleles normally 
appear in pairs to form a genotype at each marker locus 
or QTL with one from the father and one from the 
mother, except for the sex chromosomes in males. That 
is, at each locus we have two within-locus risk factors 
that reside on a homologous pair of chromosomes. 
Unlike the classical two-way ANOVA model in which 



the two risk factors own different risk units, the paternal 
and maternal risk factors at a locus often share the same 
set of alleles. Besides, the parental origins (i.e., the 
phase) of the two alleles at each locus are quite often 
unknown. These features could sometimes complicate 
the allele-based coding of marker genotypes and gener- 
ate confusion in interpretation of the model parameters. 

In this study, we introduce three allele-based coding 
schemes for building F„„ models, namely allele, F„ and 
allele-count codings. First, we formulate Fo„ models 
under a general regression framework to model the 
expected genotypic values at given markers or QTL. 
Then, under a standard ANOVA model setting, we pre- 
sent several fully parameterized one-locus models using 
the three allele-based coding schemes. Some potential 
collinearity relationships among the coding variables of 
the marker genotypes are clarified. Strategies to avoid 
the redundant model parameters are also proposed. 
After that, we examine the definition of model para- 
meters under a reduced one-locus model framework. 
The impact of a linear relationship among the coding 
variables of marker genotypes on the estimability of the 
model parameters is fully explored based on the linear 
model theory. Finally, we consider extension of the one- 
locus models to two-locus situation. Several fully para- 
meterized as well as reduced two-locus models are 
addressed. A focus of this study is to establish the rela- 
tionships between the model parameters and the 
expected genotypic values at given marker loci or QTL 
for various Foo models from these three coding schemes 
under various different model frameworks, and explain 
how to estimate and test for various genetic effects 
through statistical contrasts. Relationships among differ- 
ent coding schemes and models are also illustrated 
through simulation. 

Results 

Fully parameterized one-locus models 

In genetic studies, a quantitative trait Y is typically con- 
sidered as a combination of a genetic component G and 
an environmental component E with perhaps the genetic 
by environmental interactions G x E, where G is the 
true genotypic value from a joint (unobservable) contri- 
bution of all the genetic factors to the quantitative trait 
Y. In practice, given a random sample of N individuals 
from a study population, let gi be the observed geno- 
types at certain target marker loci or QTL and z, be a 
vector of some environmental covariates that may con- 
tribute to the variation of the quantitative trait for indi- 
viduals i = 1, N. By ignoring the genetic by 
environmental interactions and assuming that the geno- 
typic value G and environmental component E do not 
depend on the environmental covariates z, and gi, 
respectively, then the observed quantitative trait yi of an 
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individual / can be expressed through a regression 
model as 



Yi = G[g,) +ZiP + ei,i =l,...,N 



(1) 



where G{gi) = E(G|^,) is the expected genotypic value 
of G given the marker (or QTL) genotypes gi, fi denotes 
the effects of the environmental covariates, and e, is the 
residual error of the model with E(e,) = 0. Similar to 
introducing dummy variables for the covariates z, which 
allow us to assess various environmental effects fi in the 
model, it is convenient to further represent G(^/) as G 
C?i) = x(gi)oi. so that we can fit the regression model and 
assess the genetic effects a of the markers or QTL, 
where x(gi) is a coding function of the marker geno- 
types. When the marker locus is not associated with the 
phenotype, then G(gi) = E(G) is a constant which does 
not depend on gi. In the rest of the paper, we will focus 
on the interpretation of the marker effects a in terms of 
the expected genotypic values G(g) = E(G|^) according 
to different coding schemes. When certain genetic by 
environmental interactions are included in the model, 
the interpretation of a could be modified accordingly. It 
has to be pointed out that QTL are generally assumed 
to be unknown genomic regions that may contribute to 
the variation of the quantitative traits with their geno- 
types unobserved. But the results (i.e., the coding 
schemes and the relationships between the model para- 
meters and the expected genotypic values) are held for 
QTL as well, although the expected genotypic values at 
a target QTL can no longer be directly estimated via fit- 
ting the regression models. 

Now, consider one target marker locus with multiple 
alleles Ay, Am, m > 2. In general, there are m possi- 
ble homozygous genotypes AjAp 7 = 1 m, and m(m - 
l)/2 possible heterozygous genotypes AjAi^, j * k. Let 
Gji( = E(G|^ = AjA/^) be the expected genotypic values, 
given the marker genotypes AjAf- in a study population. 
Without knowing the parental origins of the alleles, we 
assume as usual that the parental origin of the alleles 
does not make a difference (i.e., no imprinting). We 
have then Gj/^ = G^y for /, k = 1, m, and there are 
totally m(m + l)/2 possible distinctive expected geno- 
typic values Gjk, j, k = 1, m, which could be esti- 
mated through the means in the genotypic subgroups 
after adjustment for the environmental covariates. 
Here we assume no missing genotypes for the sampled 
individuals, and the random sample has its individuals 
carrying all possible genotypes. How to handle missing 
genotypes will be discussed in the discussion. To fully 
re-parameterize these expected genotypic values 
through a linear model, we then need totally m{m + 
l)/2 parameters including the intercept in the model. 
By treating the paternal and maternal alleles as two 
independent risk factors and following the classical 



two-way ANOVA notation, we can represent the 
genotypic values Gjf: as 

, m (2) 



Gjk = 11* + a* +oil+ S*^,j, k= I, 



where and are the realized (but unobservable) 
additive effects of allele Aj and the allelic interaction 
between the two alleles Aj and A^, respectively. The 
above model is different from the classical two-way 
ANOVA model in that here both the paternal and the 
maternal risk factors share the same set of alleles Ai, 
Ayn- As usual, with the unknown paternal origins of 
alleles at the locus, we assume the paternal and mater- 
nal alleles have the same genetic effect. More precisely, 
the paternal allele Aj and maternal allele Aj have the 
same additive allelic effects o;* for j = 1, m. Besides, 
the allelic interaction between a paternal allele Aj and a 
maternal allele is the same as that between the pater- 
nal allele Af^ and the maternal allele Aj-, i.e., = ^kj, for /, 
k = 1, m. Still, with m additive allelic effects and m(m 
+ l)/2 allelic interactions plus the intercept, it is clear 
that model (2) is over-parameterized on modeling the m 
{m + l)/2 expected genotypic values Gy^^ for k = \, 
m. As a result, the parameters n", oi* and in model 
(2) are not all estimable in terms of the expected geno- 
typic values Gji^ (see [14,15]). 

In order to avoid the inestimability issue, one way is to 
add constraints on the model parameters. However, those 
constraints, together with the symmetry property of ^jfe, 
could make it difficult to fit the model using the standard 
software package such as SAS. Alternatively, we consider 
dropping certain redundant parameters in the model. 
Similar to the biallelic case [10], let us first introduce the 
following indicator variables to describe the transmission 
of alleles from parents to their offspring 



Zlj 



and 



Z2j 



1, inherited Aj on paternal gamete, 

0, inherited other alleles on paternal gamete 



1, inherited Aj on maternal gamete, 

0, inherited other alleles on maternal gamete 



for each allele type Aj, j = 1, m. Then we define the 
following coding variables of the marker genotypes 



UfjlS) =Zlj+Z2j 



2, ifg ■ 
I, if S ■ 
0, if ^ : 



-AjAj 
■ A^jA'j 



_ I 1, \fg = AjAu 
' ~ 0, otherwise 



for k = \, m, where A'j denotes any other allele 
type except Aj. Note that Zij, Z2yare not observable 
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because we do not know exactly which allele is inherited 
from paternal or maternal gamete for the sampled indi- 
viduals without their parental information. But this 
unknown phase problem does not affect the definitions 
of Wp Vj/^ since Wj only counts the number of allele Aj in 
the genotypes and the value of is 1 when the geno- 
type is AjA/^ and 0 otherwise regardless of where the 
two alleles come from. We refer to the above coding of 
marker genotypes as an allele coding scheme. Model 
(2) can then be re-written in a linear model form as 



(3) 



for i = 1, N. As each individual always carries 
two alleles at a marker locus with one from the 
father and the other from the mother, we have 
Ej=i ^ijte) = Er=i ^2k{gi) = 1, for any / = 1, N. There- 
fore, given a particular j, Wjk = 2 — Efeyj which is a 
linear combination of the rest of {w/^, k * j}. For Vji^, we 
also have Y1J=\ ^jk = ^ih or Vjk = Wk/2 — YlyjVik. Hence, 
each of the Vy^, k = 1, m, is also a linear combination 
of the coding variables {w^, k * j} and /, k ^ J}. To 
avoid the redundancy of parameters due to these colli- 
nearity relationships among the coding variables in model 
(3), without losing generality, we consider dropping 
and {v^^, k = 1, m] in (3). Then 



m— 1 



(4) 



for i = 1, N. Model (4) now provides a full re-para- 
meterization of the m{m + l)/2 expected genotypic 
values Gji^ for /, k = 1, m with its parameters a, can 
be referred to as the additive allelic effects and Sji^ the 
allelic interactions with respect to the reference allele 
j4^. Given a random sample, we can then incorporate 
model (4) into (1) and fit the regression model (1) using 
the standard least-square approach. In terms of the 
expected genotypic values, it is easy to show that = 

G mm) ~ Gjm ^ mm and 3j]^ — {Gjf^ - G/f^yfj) - {Gjyfj 

Gmm), for j = 1, m - 1 and k = j, m - 1. Therefore, 
the additive allelic effect aj can be interpreted as the 
substitution effect of replacing allele Ay„ by Aj when 
paired with another allele A„j to form the genotypes. 
Meanwhile, the allelic interaction Sj/^ is the difference 
between the substitution effect of replacing allele Ay„ by 
Aj (or Af;) when paired with allele A^ (or Aj) and that 
when paired with allele Or, in other words, Sj/^ is 
the difference between the substitution effects of repla- 
cing allele A^ by Aj (or A/^) with paired alleles A/^ (or A/) 
and A„,. Note that dropping Wj and {v/^j, k = 1, m} for 
a particular j * m instead of w„ and {v,^^, k = 1, m} 
can lead to similar interpretations of the model 



parameters with Aj being the reference allele. Using 
model (4), we can also estimate and test for various 
other genetic effects. For example, the so-called func- 
tional 'additive effects' ^i*;, = {Gjj — Gkk)/2 and the 'domi- 
nance effects' d*f^ = Gjk — [Gjj + Gkk)/2, j * k defined in 
[11] can be expressed as a*^ = ("j - afe) + {^jj - Skk)/2 
and d*f^ = &jk — + Skk)/2 — 2ii^ j k, respectively, in 
terms of the above model parameters. So we can esti- 
mate fljfo using the fitted model parameters or test 
for the hypothesis of : a*^ = 0 or Hq : d*^ = 0 through 
the general linear contrasts [15] using the standard soft- 
ware such as SAS. To test whether a particular allele Aj 
has an overall effect, the null hypothesis is Hq : aj = Sj/^ 
= 0 for k = 1, ...I m - 1, which can be performed through 
either a general linear contrast (or likelihood ratio test) 
with the degrees of freedom being m for the test statis- 
tic. The association test for overall effects of the locus 
corresponds to the null hypothesis of Hq : aj = 5ji^ = 0 
for any j, k = \, m - I, which has its degrees of free- 
dom being m{m + l)/2 - 1 for the test statistic. Cur- 
rently, the so-called F,,, model has been widely used in 
genetic association studies. In the simple biallelic case 
with two alleles A and a, an F^ model gives [16-19]. 

Gaa = T + a, GAa = T + d, Gaa = T — a 

where Gaa = HG\AA), GAa = UG\Aa) and Gaa = E(G| 
aa) are the three possible expected genotypic values at the 
marker. The parameters a, d are often referred to as the 
additive and dominance effects of the allele A over a, and 
in terms of the expected genotypic values we have a = 
{Gaa - Gaa)/'2 and d = Gau - (Gaa + Gaa)l2. This F„ model 
can also be written in a linear model form as [10] 

G[gi) = x + af[gi) + dh[gi), i=\,...,N 

where f, h are two coding variables of the marker gen- 
otypes that are defined as 



m- 

Kg) 



1, \ig = AA 
0, \ig = Aa 
— I, \ig = aa 

1, \ig = Aa 
0, otherwise 



We refer to the above coding of the marker genotypes 
as the F„„ coding. As a straightforward extension of the 
F„ coding scheme to multiple alleles, we can define the 
following coding variables 



fi{g) = 



1- ifg 

0, ifg 
-1, if g = A|A| 

1, if g = A,A; 
0, otherwise 
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for each J = 1, m. It is easy to see that fj, hj and the 
previous Wj, Vj/^, j, k = 1, m have the relationships: fj 
ig) = wjig) - 1, hj{g) = wjig) - 2vij{g), and Vji,(g) = hj{g)h/, 
ig) as j ^ k. Thus, for the same reason to avoid colli- 
nearity, we can exclude some redundant coding vari- 
ables and write a fully parameterized one-locus model 
using the Fo„ coding as 



G(&) = ^ + XI ^J^'fe^ + Yl, '^iMS') 

m— 1 m— 1 

+ djkhj{gi)hkigi] 



(5) 



for i = 1, N. By having model (5) equivalent to (4), 
we can first build the relationships between the two 
model parameters and then establish the relationships 
between the parameters of model (5) and the expected 
genotypic values as following 



M + E («j + f ) 

,] = I, . . . ,m 

,; = !,... 

djk = Sjii = (Gji. — Gjm) ~ (Gfeni — G„ 



Gil— Gjm 



2 ~ 



Gi,+G,„ 



- 1 

, m 



1 



Therefore, Uj can be interpreted as a half of the differ- 
ence between the two expected homozygous genotypic 
values Gjj and G^m' which is the same as the additive 
effect defined in [11]. Besides, djj is the difference 
between the expected heterozygous genotypic value Gj^ 
and the averaged expected homozygous genotypic value 
{Gjj + Gmf„)/2, which is the same as the dominance 
effect d*^ defined in [11]. It is interesting to see that dj/^, 
j k, has the same interpretation as Sj/^ in model (4), 
which is the difference between the substitution effects 
of replacing allele by Aj when paired with alleles A/^ 
and Am- Note that djj can also be interpreted as the alle- 
lic interaction - the difference between the substitution 
effects of replacing allele Aj by A,„ when paired with 
another Aj and A^. In addition, based on model (5), the 
additive effects and the dominance effects pro- 
posed in [11] have the relationship with the model para- 
meters: fljVi = aj- au, = djk + {djj + 4fc), ; ?; k. The 
overall effect of a particular allele Aj can be tested 
through the composite hypothesis of Hq : Uj = dji^ = 0 
for k = \, m-1, and the overall effects of the locus 
can be tested via the null hypothesis of Hq : Uj = djk = 0 
for any /', k = 1, m-1. 

In addition to the allele and codings, another way 
of coding the marker genotypes which occasionally 



appears in practice is to count the number of alleles in 
marker genotypes for each specific allele Aj. As each 
individual can have 0, 1 or 2 copies of an allele Aj, by 
taking the genotypic group with 0 copy of allele Aj as 
the baseline, we can introduce the following two indica- 
tor (or dummy) variables for the genotypic groups with 
1 and 2 copies of the allele Aj, respectively. 



1, ifs 

0, otherwise 

1, \fg = AjA, 
0, otherwise 



for each / = 1, m-1. These coding variables of 
marker genotypes have relationships h\j(g) = hj(g) = Wj 
{g) - 'i.Vjjig) and h2j{g) = Vjjig) with previous ones. We 
refer to this coding of marker genotypes as the allele- 
count coding. Similar to models (4) and (5), by exclud- 
ing some redundant coding variables, the allele-count 
coding leads to another fully parameterized one-locus 
model as 



m— 1 m— 1 
;=1 k=j+l 



(6) 



for i = 1, N. Similarly, by having model (6) equiva- 
lent to (4), we can establish the following relationships 



^0 = M = G„ 



7T, 



CXj — Gjm 



mm/ ] ~ If ■ ■ ■ f tyi 1 

Gmm/ i = 1, ■ ■ ■ , m — 1 
= Sjii = (Gjj. — Gjm) ~ (Gfem ~ Gmm)/ j ¥ 



m = 2o'j + Sjj = Gjj 



Therefore, tij in model (6) can still be interpreted as 
the substitution effect of replacing allele A^ by Aj when 
paired with allele Ay„, or the difference between the gen- 
otypic values of the genotype group AjAy„ with one copy 
of Aj versus the genotype group A^A^ (baseline), rjjj is 
the difference between the expected genotypic value Gjj 
in the homozygous genotypic group AjAj with two 
copies of Aj and G^m in the baseline group A^A^. 
Besides, rij/^ in model (6) has the same interpretation as 
dji^ (or dji^) before. From model (6), the general additive 
effects <JjSj, = ilij — '?fefe)/2 and the dominance effects 
^jk = ^ji' ~ ('5;; + ^kk)/2 - Ino, j it k, which can be tested 
either separately or jointly. The overall effect of a parti- 
cular allele Aj can be tested through the composite 
hypothesis of Hq : jij = rjj/^ = 0 for k = 1, m-1. The 
overall effects of the locus can also be tested via the null 
hypothesis of Hq : jtj = r\jk = 0 for any /, k = 1, m-1. 
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Each of the three models (4), (5) and (6) provides a full 
re-parameterization of the m(m + l)/2 expected genoty- 
pic values under the same model framework (3). The 
relationships between their model parameters and the 
expected genotypic values are summarized in Table 1. It 
is interesting to see from Table 1 that the null hypothesis 
of a, = djj = 0 is equivalent to either aj = djj = 0 or jtj = 
r\jj = 0, which implies Gjj = Gjm = Gmm- So the three 
models above should provide the same test statistics for 
testing Uj = Sjj = 0, Uj = djj = 0 or tt, = r\jj = 0. 

For a biallelic locus with alleles A (or Ai) and a (or A2), 
we have m = 2 with three possible genotypic values 
Gaa = E{G\AA), Gau = E{G\Aa) and Gaa = E{G\aa). 
If we adopt the allele coding, then W2(g) = 2 - w^ig), V12 
ig) = w-^(g) - vii(g), and V22(g) = 1 - wy(g) + Vn(^). For the 
F,^ coding, we have_^(g') = -f\{g) and /?2(?) = hi(g). So we 
can further drop d2 in model (5). For the allele-count 
coding, we have hi2(g) = hnig) and h22(g) = 1 - hnig) - 
h2i{g)- The interpretation of model parameters for these 
three biallelic QTL models are summarized in Table 2, 
which is a special case of Table 1. 

For a locus with three alleles Ai, A2 (i.e., m = 3), we 
have six possibly distinctive expected genotypic values 
Gil, G22, G33, G12, Gi3 and G23. Each of the three fully 
parameterized models (4), (5) and (6) can provide a full 
re-parameterization of the six expected genotypic 
values. In a matrix form, from the allele coding model 
(4), we have 

1 2 0 1 0 0' 
10 2 0 10 
1 0 0 0 0 0 
1110 0 1 
1 1 0 0 0 0 
10 10 0 0 



Table 2 Parameterization of one-locus models (4), (5), (6) 
when m = 2. 



"Gu" 




G22 




G33 




G12 




Gl3 




_ G23 _ 







fl 




0-1 




0-2 




Sn 




S22 




_Si2_ 



Codings 


IVIodels 


Relationships 


Allele 


Gaa = M + 2ai + S 


11 M = Gaa 




GAa = fJ' + oil 


Oil = GAa ~ Gaa 




Gaa = 


Sii = Gaa + Gaa — 2GAa 




Gaa = r + ai 


'■ 2 




GAa = T + d\i 


^ _ GAA~Gaa 

u\ 2 




Gaa = T — fll 


Mil - ^Aa 2 


Allele-count 


Gaa = tto + rju 


^0 = Gaa 




GAa = TTo +7Tl 


^1 = GAa ~ Gaa 




Gaa = ^0 


riu = Gaa — Gaa 



From the F„ coding model (5), we have 

1 1 -1 0 0 0' 
1-1 10 0 0 
1-1-10 0 0 
1 0 0 111 
1 0-1100 
1-1 0 0 10 

And the allele-count coding model (6) gives 



"Gil" 




G22 




G33 




G12 




Gl3 




_G23_ 







T 




ai 




a2 
dn 




d22 




_di2_ 



"Gn" 




"1 0 0 1 0 0" 




TTo 


G22 




10 0 0 10 






G33 




10 0 0 0 0 




IT2 


G12 




1110 0 1 




mi 


Gl3 




110 0 0 0 




r]22 


_G23_ 




1 0 1 0 0 0_ 




r]i2 



By multiplying the design matrices on the left side of 
the equations, we can show that the model parameters 



Table 1 Parameterization of fully parameterized one-locus models (4), (5), (6). 



Codings 


Relationships 






Allele 


/X = Gfnmi 01 j — Gjm Gmm 

Sjj = Gjj + Gmm ~ 2.Gjmi j — 1, ■ ■ ■ , tn — 1 








^jk = (Gjfe — Gjm) — [Gkm ~ Gmm)i jik = 1, . . 


, m — 


1;; < fe 




X — Gmm + 2 / yj (G;j Gmm) 

flj — 2 ' Mjj — Kjjm 2 / 7 ~ ^ ' ■ • ■ ' 


- 1 






djk = (Gjfe — Gjm) ~ {Gkm ~ Gmm): )ik = 1, . . 


, m — 


i;j < fe 


Allele-count 


^0 — Gmmi ^} — Gjm Gmm 

rjjj = Gjj — Gmm' i — 1, ■ ■ ■ , tn — 1 








^jk = [Gjk — Gjm) — [Gkm ~ Gmm)i jik = 1, . . 


, m — 


1;; < fe 
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Table 3 Parameterization of one-locus models (4), (5), (6) 
when m = 3. 

Codings Relationships 

Allele /X = G33 

oil = Gi3 — G33,0;2 = G23 — G33 

Sii = Gil + G33 — 2Gi3 
822 — G22 + G33 — 2G23 

^12 = G12 + G33 — Gi3 — G23 

'■ ~ 2 

_ Gil— G33 ^ _ G22— G33 
"1 - 2 '^2 - 2 

/J, — (" , G11+G33 

"11 - (jis 2 — 

— r^,, G22+G33 

"22 - 1^23 2 

^^12 = G12 + G33 — Gi3 — G23 

Allele-count ttq = G33 

^1 = Gi3 — G33,JT2 = G23 — G33 

rjii = Gil — G33 
'722 = G22 — G33 
rii2 = G12 + G33 — Gi3 — G23 



and the expected genotypic values have the relationships 
as summarized in Table 3, which is consistent with that 
in Table 1. 

Reduced one-locus models 

Due to limited available sample sizes in practice, it may 
not always be feasible to use the fully parameterized 
models. Quite often, one may want to check the main 
effects of alleles first before including all possible allelic 
interactions. Here we consider the case of including pos- 
sible interactions between Aj and itself for the homozy- 
gous genotypes Aj Aj, j = 1, m, but ignore other 
interactions between different alleles Aj and A/^ (j ^ k). 
Then we obtain a reduced case of model (2) as below 

Gjfc = /i* +a* +«,* + (7) 

for /, k = \, m. Similarly, using the allele coding, we 
can present this model in a linear model form as 

m m 

G{gi) = M* + ^ a/wjfe) + '^/''ife) (8) 

;=1 ;=1 

for i = 1, N, where Vj(g) = Vjj(g) for j = 1, m, with 
Vjjig) defined as before. 

Model (8) contains only one redundant parameter in 
the a*'s due to the fact that Wjfe) = 2 for i = 1, 
N. In this case, as shown in Appendix A, the parameters 
3*, . . . ,8'^ in model (8) are estimable but the para- 
meters and a*, . . . , aj^ are not estimable. To 



overcome the redundant parameter problem, we can 
drop w„ from model (8) and consider 

m— 1 m 

G{gi) = M + XI + ^t^i^^''^ (9) 

for / = 1, AT. Note that 

Vm = Zi,nZ2m = 1 - Ej=7^ + llk=l ^jf" which 

cannot be completely determined by {Wj, Vj, j = 1, m - 
1}. Therefore, dropping {3jh j, k = I, m - 1, j <k] from 
model (4) does not directly lead to an equivalent model 
of (9) as the latter contains v^. In fact, as further drop- 
ping in (9), it will lead to a more restricted model 
structure for the expected genotypic values with the 
similar interpretation of its model parameters as pre- 
sented in model (4). It is also interesting to see that the 
haplotype coding proposed in [13] is a special case of 
model (9) when we further ignore all the allelic interac- 
tions and drop all the {vj, j = 1, m} in the model. 

By definition, a reduced model can be derived from its 
original model by adding certain restrictions on the 
model parameters. Typically, the model parameters in a 
reduced model could be interpreted similarly as that in 
its original model when these restrictions are simple 
enough (e.g., by setting a subset of them being zero). 
When the restrictions on the original model parameters 
are complicated, however, the interpretation of the 
reduced model parameters could be different from that 
presented in the original model. For model (9), we can 
establish the relationship between its model parameters 
and the expected genotypic values using a classical 
matrix approach, as shown in Appendix B. An alterna- 
tive way of building this relationship is to simply treat 
model (9) as a reduced form of model (8) by adding a 
restriction a* = 0 and taking fi = fi", o^j = a* for J = 1, 
m - 1, and Sj = S* for = 1, m. Note that adding the 
restriction = 0 on (8) does not change the modeling 
structure of the expected genotypic values because a,^ is 
a redundant parameter given the others. Therefore, 



11 


— Gmm ^ni ~ Gjm + Gkm Gjk, 






aj 


= Gjm - fJ-* = Gjk - Gkm, fe ¥if "1/ 




i = I, . . . ,m — 1 




= Gjj - ill* + 2a*) 




= {Gjj - Gjk) - {Gji - Gki),i 




i = I, . . . ,m 



Comparing with the parameters in model (4), we can 
see that the interpretation of the parameters in model 
(9) have changed slightly. The intercept f^ now becomes 
(Gmm — S^) instead of G,„^, the a, is the substitution 
effect of replacing allele A„ by Aj when paired with any 
allele A/^ (k * j, m) instead of just A^, while the 8j is the 
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difference between the substitution effect of replacing 
any allele by Aj when paired with Aj itself and that 
when paired with another allele A/ {I ^ j, k). If both ay 
and 8j are zero for a particular <m, then G,y = G/^ = n 
and Gji^ = G^^ for any k * j, m. 

Under the same model framework (8), the F„ coding 
leads to the following model 

m— 1 m 

G(&) = ^ + Y. '^ifi^Si) + J2 '^i^i^Si) (10) 

for i = 1, N. By applying the relationship fj{g) = Wj 
ig) - 1 and hjig) = Wj(g) - 1Vj{^ for / = 1, m, we can 
show that for models (10) and (8) to be equivalent their 
model parameters have the relationship 

^ = + EK + i) 

aj = a* + = 1, . . . , m — 1 
flj = — y,; = 1, . . . , m 

rv* + S. = 0 

In other words, model (10) leads to a restriction 
2a* + ~ 0 on the parameters in model (8) which 
makes = Gmm - (2q;* + S*„) = Gmm, a* = -3^/2 and 
a* = Gjm - (m* + a,;) = Gjm - Gmm + ^J,/ 2, = 1, m - 
1. Thus, 

^ m— 1 

^ — Gmm + 2 E (^i? Gmm) 

flj = = 1, . . . , m — 1 

; = 1, . . . ,m 

Now (iy becomes a half of the difference between the 
substitution effect of replacing any allele Af- by Aj when 
paired with another Aj and that when paired with an 
allele Ai (/ ^ j, k), which can no longer be referred to as 
a dominance effect. 

With the allele-count coding, we can actually con- 
struct two equivalent models in this case 

m— 1 m 

G[gi) =iTo + J2 ^MjiSi) + J2 ^s^ni&i) (11) 

j=l j=l 

and 

m m— 1 

G{gi) = ^0 + I] ^'i^i; fe) + (g.) (12) 

for i = 1, A^. Similarly, we can show that model (11) 
can be treated as a reduced model by adding the restric- 
tion = 0 on parameters in model (8) with the 



following relationships 

^0 = M* — Gjm + Gjim ~ Gjk, 

i =^k=^ m 
TTj = a* = Gjk -Gkm,k yj, m, 

i = I, . . . ,m — 1 
rij = 2a* + S* = {Gjj - Gjm) + {Gjk - Gkm), 

k =/], m,j = 1, . . . ,m— 1 

'?m — ^m ~ {Gmm Gjm) {Gkm Gjk), 

On the other hand, model (12) can be treated as a 
reduced model by adding the restriction 2a^ + S* = 0 on 
parameters in model (10) with the following relationships 

TT 0 = A*-* = G,jim 

/ _ _ ( Gjm -G,„„] + {Gjk- Gkm ) 

j - Olj - 2 ' 

k m,j = 1, . . . , m — 1 

/ _ _ {Gfnm~Gjm)—{Gkm—Gjk) 

^ m — ~ — 2 ' 

ij = 2a* +S* = Gjj- Gmm, 

j = 1, . . . , m — 1 

While the effect rijj in model (6) is the difference 
between the two expected homozygous genotypic values 
Gjj and G„„, the effect rjj in model (11) becomes the 
sum of the substitution effects of replacing allele j4„ by 
Aj when paired with Aj itself and when paired with 
another allele A/^ (k ^ j, m. It is also interesting to see 
that the definition of parameters in models (11) and (12) 
are quite different. A null hypothesis of Wq : tt- = (jj = 0 
for a particular / <m in model (12) implies that Gjj = 
Gf„m and Gy^ - G,„^ = Gji^ - G^^ for any k * j, m, while 
the null hypothesis of Hq : Jij = r\j = Q for a j <m in 
model (11) implies that Gy, = Gj^^, and Gy^ = G/^^ for any 
k it j, m, which has nothing to do with G^m- 

Under the same model framework (8), each of the above 
four models (9), (10), (11) and (12) contains 2m non- 
redundant parameters (including the intercept) to model 
the m{m + l)/2 expected genotypic values. When w > 3, 
we have m{m + l)/2 > 2m. Therefore, the model frame- 
work (7) enforces certain constraints on the m{m + l)/2 
genotypic values. If w = 3, then each of the four models 
actually provides a full re-parameterization of the six 
expected genotypic values Gn, G22, G33, G12, G13 and G23. 
The relationships between the four model parameters and 
the expected genotypic values are summarized in Table 4. 

Comparing Table 4 with Table 1, we can see that the 
definition of model parameters depends not only on the 
coding schemes of marker genotypes but also on the 
underlying framework for the structure of the expected 
genotypic values. From Table 4, it is also interesting to 
see that the null hypothesis of Hq : aj = Sj = 0 (j <m) in 
model (9) is equivalent to TZy = rjy = 0 in model (11), 
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Table 4 Parameterization of one-locus models (9), (10), (11), (12) when m > 3. 



Codings 


Restrictions 


Relationships 


Allele 


K = o 


fl = fl* = [Gjm + Gkm) - Gjk,i T^kj^m 

(Xj = 01* = Gjk — Gktnij = 1, . . . , wi — 1; j =^k,m 

Sj = S* = [Gjj - Gjk) - {Gji - Gki),i = 1, . . . , m; fe / / 


F„ 




m— 1 m— 1 
T = /i* + i ^ {2a* + S*) = Gmm + i XI ~ 

aj = \{la* + 8*) = ^MzSn^^j =l,...,m-l 










dj- f - (G,-q.)-(q,-G«) _ 1^ . . . , ^.y _/ ; 


Allele-count 


< = o 


710 = li' = (Gjm + Gta) - Gjt,j y fe y m 

TTj = a* = G,t - Gta.j = 1, . . . , m - 1; fe yj, m 



J7j = Iff* + 3j* = (Gjj - Gj„,) + [Gjk - Gkm),] = 1 m-l;k^j,m 

>lm = K, = (Gmm " Gjm) - (G,„„ - Gj/,),; ^ ^ m 

Allele-count 2«*. + 5;^ = 0 Jr'o = = G„.,„ 

^ ^* ^ (Qm-GmmMGj„-G„m) ^^- = 1 , . . . , ^ _ 1 ; fe ^ ^ 

r]'j = la* + 5* = Gjj - Gmm,] = 1, . . . , m - 1 



which imphes - S* - 0 in model (8) with restriction 
= 0, or Gj/t = G^^ for any k = 1, m. On the other 
hand, the null hypothesis of Hq : Uj = dj = 0 (j <m) in 
model (10) is equivalent to = = 0 in model (12), 
which implies oi* = S* = 0 in model (8) with a restriction 
2a* + 5* = 0, or Gjj = Gy^m and Gjj - Gj^ = Gj/, - Gi^m 
for any k ^ m. In general, the two null hypotheses of Uj 
= Sj = 0 and Uj = dj = 0 may not always be equivalent. 
For example, when m = 3, similar to the three-allele 
models discussed in the previous section, we can show 
that the four model parameters and the expected geno- 
typic values have the relationships as shown in Table 5, 
which is a special case of Table 4. We can see from 
Table 5 that = di = 0 is equivalent to tti = 771 = 0 
which implies G12 = G23 and Gn = G13; while ai = di = 
0 is equivalent to 7t[ = r][ = 0 which implies Gn = G33 
and G12 + Gi3 = Gu + G23. So, depending on the 
underlying true setting of the expected genotypic values, 
the null hypotheses of «! = (5i = 0 in model (9) could 
be different from that of ai = di = 0 in model (10). 

Extension to two-locus models 

In this section, we further explore some extensions of 
the previous one-locus models to two-locus models. 
Consider two marker loci with alleles An, . . . , Ai^i at 
locus 1 and alleles A21, ■ ■ ■ ,A2m2 at locus 2, respectively. 
Without distinguishing the parental origins of the 
alleles, there are totally mivn^im^ + l)(m2 + l)/4 possi- 
ble distinctive expected genotypic values: Gji^rs = E(G| 



Table 5 Parameterization of one-locus models (9), (10), 


(11), (12) when m = 3. 






Codings Restrictions 




Relationships 


Allele al = 0 


fl -- 


- Gi3 -1- G23 — G12 




ai 


= G12 — G23/Q!2 = G12 — Gi3 




Si 


= (Gil — G13) — (G12 — G23) 




82 


= (G22 — G23) — (G12 — G13) 




S3 


= G33 -1- G12 — Gi3 — G23 


F«, 2a* +S*=0 




^ _ G11+G22 
'■ ~ 1 






^ _ Gil— G33 ^ _ G22— G33 
«1 - 2 ' "2 - 2 






J _ (Gl2+Gl3)-(G23+Gll) 
6*1 2 






^ _ (G12+G23) — (G13+G22) 






J (G13+G23] — (G12+G33) 
"3 - 2 


Allele-count a| = 0 


7To 


= Gi3 -1- G23 — G12 




Til 


= G12 — G23/ = G12 — Gi3 




Vi 


= (Gu - G13) -1- (G12 - G23) 




m 


= (G22 — G23) -1- (G12 — G13) 




m 


= G33 -1- G12 — Gi3 — G23 


Allele-count 2a* + 5| = 0 


K 


= G33 






(G12+G13)— (G23+G33) 
2 






(G12+G23)— {G13+G33) 
2 




A 


(G13+G23)— {G12+G33) 
2 






= Gil — G33, = G22 — G33 
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AijAikA2A2s) for ^ = 1, mi, j < k; and r, s = \, .... 
W2, r < s. Using the allele coding, we introduce the 
following coding variables 



njk[g) = 
j, k=l, 

Wlrig) = 



2, iig 
hifg 

0, iig - 

1, iig = AijAik 
0, otherwise 



AijAij 



mi, for marker genotypes at locus 1 and 

A2rA2r 



2, iig 
i.iig 
0, ifg 



A2rJ%_r 



1, if^ = A2rA2s 



0, otherwise 



r, s = \, W2, for marker genotypes at locus 2, where 
Ay (or Ajr) denotes any other allele type except Ay 
(or A2r) at locus 1 (or 2). A fully parameterized 
two-locus model for Gjkrs can then be presented as 



(13) 



mi — 1 mi — 1 mi — 1 

Gigi) = M + ^ oiijWij + X! X! ^Vk^Vk 

m2— 1 m2 — 1 m2 — 1 

r=l r=l s=r 

mi — 1 m2 — 1 

+ ^ ^ («l;0'2,)WljM'2r 
j=l r=l 
mi — 1 m2 — 1 m2 — 1 

+ ^ ^ ^ (Q'l;52rs)M'l;l'2rs 
j=l r=l s=r 

mi — 1 mi — 1 m2 — 1 

+ ^ ^ ^ (5l;f,0'2,)VijfeU;2r 
j=l fc=j r=l 
mi — 1 mi — 1 m2 — 1 m2 — 1 

+ X X X X (^l;'''52rs)l'l,feV2r5 
j=l fe=; r=l s=r 

for i = 1, Similar to the one-locus models, we 

can establish the relationship between the model para- 
meters and the expected genotypic values as shown in 
(C.l) of Appendix C. A nice property of this allele cod- 
ing model is that a higher order effect is simply the 
deviation of its corresponding expected genotypic value 
from an approximation of the other lower order effects. 
Here the corresponding expected genotypic value of a 
marker effect is determined by the position of alleles 
that differ from the two reference alleles Ai^i and A2m2- 
So, starting from the lowest order parameter fi, it seems 
straightforward to build the relationships between the 
model parameters and the expected genotypic values 



starting from the low-order effect parameters up to the 
high-order effect parameters. 

For the coding, we can define the following coding 
variables for the genotypes at the two marker loci sepa- 
rately. 



fijig) = 
hijig) = 



hifg 
0,ifg 
-hifg 

hifg- 
0, otherwise 



= AijAij 
= AijA\. 

MjA\j 



for j = 1, mi, and 



flrig) 



hlrig) = 



1, if g = A2rA2r 
0, if g = A2rA'2, 

-hifg 

1, if g = A2rA2r 

0, otherwise 



A'^ A'^ 



for r = 1, A fully parameterized two-locus 

model using this coding is then 



mi — 1 



m2 — 1 



Gte) = ^ + X X '^^rflAgi) 



(14) 



j=l r=l 
mi — 1 mi — 1 

+ X X '^ijfe'^ijfeo^ifefe) 

i=i H 

mi — \ m2 — 1 

+ X X ^2rs^2r(&)/j2s(gO 

r=l s=r 
mi — 1 m2 — 1 

+ X X i^ii'^2r)fljf2r 
j=l r=l 

mi — 1 m2 — 1 m2 — 1 

+ XI X X («lj^2rs)/lj^2r^25 
j=l r=l s=r 

mi — 1 mi — 1 m2 — 1 

+ X X X {dijka2r)hijhikf2r 
j=l k=j r=l 
mi — 1 nil — I m2 — 1 m2 — 1 

+ X X X X [dijkd2rs)hljhikh2rh2s 
j=l fc=; r=l s=r 



for i = 1, N. Still, using the relationships Wij= 1 + 

flj, W2r= 1 +/2r. Vlij= (1 + /l;- ^ly). V2rr= (1 + /2r- ^2r)> 

Vijk= hijhikior j ^ k, and V2rs= ^2r^2sfor r ^ s between 
the Fo„ coding variables and the allele coding variables, 
we can establish the relationships between the model 
parameters and the expected genotypic values as 
shown in (C.2) of Appendix C. We can easily verify 
that the biallelic two-locus effects ^f^^-ab in [9] is a 
special case of our results with Wi = W2 = 2. It is also 
interesting to see that the interpretation of model 
parameters in terms of the expected genotypic values 
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becomes much more complicated than that in the 
previous allele coding model. When Wi, »i2 > 2, the 
low-order within-locus main effect aijis a weighted 
combination of the differences {Gjpr — Gm^miir), where 
r = 1, m2 refer to various homozygous genotypes 
A2rA2r^t locus 2. The within-locus effect dyjis a 
weighted combination of the allelic interactions 

[Gjjrr - 2Gjmirr + Gmitnitr), r = 1, W2, at locUS 1 with 

reference A2rA2r^t locus 2. Even the intercept x of the 
model becomes a complex function of various homozy- 
gous genotypic values. 

Applying the allele-count coding, we can define 



0, otherwise 

2i I 0, otherwise 
for J = 1, mi, and 



l,if g = A2rA'2, 

0, otherwise 

l,if g = A2rA2r 

0, otherwise 



for r = 1, W2- Another fully parameterized 
two-locus model for Gjkrs can be written as 



m,-l 



m2 — l 



r=l 
mi — 1 mi — I 



Ik 



j=i k=j+i 

m2 — 1 m2 — 1 

r=l s=r+l 
mi — 1 m2 — 1 

- E E 

j=i 1=1 

+ (jrij'72rr)^y'4r' + imjj^2r)h\fhf^ 
mi— 1 m2 — 1 m2 — 1 



,(2) 



(15) 



E E E 

j=l r=l s=r+l 
mi — 1 mi — 1 m2 — 1 

- E E E [i'n,^2r)r^^>^ 

j=l k=j+l r=l 

mi— 1 mi — 1 1112—1 m2 — l 

j=l k=j+l T=l s=r+l 
"ij "ife "ir "is 



for i = 1, N. In this case, the allele-count coding 
variables and the allele coding variables have 

the relationships wij = h[]^ + 2h^jl W2r = h^^^ + 24^'. 
Vlrr = 4r'' ^2rT = ^2r'> ^ijk = h[^^h^^j^ for ; ^ k, and 
V2rs = hj'J^/ijj' for r ^ s. Through the equivalence of the 
two models (13) and (15), we can also construct rela- 
tionships between the parameters in model (15) and the 
expected genotypic values as shown in (C.3) of Appen- 
dix C. We can see that the interpretation of parameters 
in the allele-count coding model (15) are as simple as 
that in the allele coding model (13) with the same inter- 
cept being Gmimimama- Besides, it seems that some para- 
meters such as {riijjri2rr), {rjijkriirs) and (riyi,ri2rr) have 
simpler relationships than the corresponding ones in the 
allele coding model (13). 

Finally, let us consider some reduced cases of the two- 
locus models. By ignoring locus-by-locus interactions (i. 
e., epistases), we have the following simplified two-locus 
model framework 



Gi * * * r.* * * r.* 



(16) 



for /, k = 1, OTi and r, s = 1, W2. If we further 
ignore the within-locus allelic interactions between dif- 
ferent alleles, then another reduced two-locus model 
framework is 



* * * o* 1 

IX + a, + a,i^ + d, 1 



* * P * -1 

+ 0-2, +aj, + 5j,.l|r= 



(17) 



Similar to the one-locus models, under each of the two 
reduced model frameworks we can construct the two- 
locus models from the three coding schemes. The rela- 
tionships between the model parameters and the 
expected genotypic values under framework (14) are 
summarized in Table 6, which can be treated as an 
extension of Table 1 to the two-locus case. The relation- 
ships between the model parameters and the expected 
genotypic values under framework (17) are also summar- 
ized in Table 7, which is a straightforward extension of 
Table 4. Further dropping S*j for / = 1, mi and for 
r = 1, m2 in (15) will lead to an additive model frame- 
work, which has its model parameters interpretable simi- 
lar to that in Table 6. From Tables 6 and 7, we can see 
that both the allele and allele-count coding models have 
their lower-order main effects keep similar interpretation 
as to that in the previous fully parameterized case with 
epistases, while the Fo„ coding models have the definition 
of their lower-order main effects vary depending on 
whether there are epistases involved in the models. 

As pointed out in [9], the genetic effects of a marker 
may have different interpretation depending upon 
whether the marker is fitted in a one-locus model or a 
two-locus model. From the linear model theory, the 
genetic effects of a marker in a one-locus model are 
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Table 6 Parameterization of two-locus models under 
model framework (16). 

Codings Relationships 

Allele 1^ = Qiiimiiujttij 

^Ij — Gjm|ni2»i2 ^miini mjmjr j — 1, - - - / '^l 1 
C(2r = GniiiHirmj ~ (^m,mim2mi' ^ — l. ■ ■ ■ - ^2 ~ 1 

^Ijk — Gjfemjmj ^jmjm2m2 Gkmjm2m2 ^miin, ihiHIj r j/k = 1, . . . , Tfli l,j < k 
^2rs — ^miniirs ^mimirm2 ^mimism2 "'" Gnxjiijimjiii;' f"; -5 = ^2 1< ' l£ -5 

. ^""""'"^^-"■-"■ .j . 1, .... m, - 1 

f 2r = ^ ^ — ' r - 1, . . . ,m2 — I 

rfij = C,„,„„„„ - ^ l,...,m, - 1 

t^ljfc — CyfemjUi; C';miin2nt2 Cifemim^m; + ^miiniin^tn^i — 1, ■ . . , lHl 
fl2;T = Gmimirmj , r = 1, . . . , m2 - 1 

l^lrs — C'mimiJS ~ C'mifjiiniij ^ ^'muntsniz + f'in]m]jn2in2/ S = 1, . . . , ft!2 1; T < i 

Allele- '^O = 

count ^1; = ^'jniimimi ^ (^•mim,mimi: j = I, - - - ,nii — I 

^IT = GmirHinn2 ^ Grt,jm,m2mj - = 1/ ■ ■ ■ < ^1 ^ 1 
'yijj = Gjjni2m2 GrtiimiWijin;/ i = 1/ ■ ■ ■ / "li 1 

'/ijfe = GjlimjiH2 Gjmim2in2 G^m,in2in3 + f'iiiimiiii2iJi2' j'^ ~ 1/ ■ ■ ■ ' ^"l !<_/ < 
'?2n- = C^nijoiiiT — Gmj^,ni2m2' r = 1, . . . , m2 — 1 

'?2ri = t^miifJiis — t^niimirmi ^ Gnii,H,sm2 + ^immimim2' ^ = 1, . . . , »l2 — 1; r < 5 



defined based on the expected genotypic values of cer- 
tain genotypes at this particular marker locus with geno- 
types at the other marker loci being averaged out based 
on the joint genotype distribution. For instance, marker 
1 in the two-locus setting above has its effects defined 
in a one-locus model based on the one-locus genotypic 
values E[Gjk) = E{Gjkrs\AijAik] = P(A2,-A2s|AijAi,,)Gj),„, 
which could depend on the LDs of alleles between the 
two loci. When the same marker is fitted in a two-locus 
model, its effects are usually functions of the expected 
genotypic values with their joint genotypes taking cer- 
tain reference alleles or genotypes at the other marker 
loci. So, in general, even without locus-by-locus interac- 
tions, a single marker's effects could be different from 
the one defined in a multi-locus model when the alleles 
at different loci are in linkage disequilibrium (LD). Con- 
sider a 2-locus haploid model with alleles A, a at locus 
1 and B, b at locus 2. If we ignore the locus-by-locus 
interaction, it is easy to show that the additive allelic 
effects are = Gab - GaB = Gab - G^b and aa = Gab - 



Table 7 Parameterization of two-locus models under model framework (17) when m-,, ^ 3. 



Codings 



Restrictions 



Relationships 



Allele 



«ij = Gjkn - Gm,te,i = 1, . . . ,mi - l;fe / j, mi 
oil, = Cjkrs - Gjkm2s, r = 1, . . . , (112 - 1, r / s, mj 
■51; = Gjjrs - Gjkn - Gjirs + Gun.j = 1 .mi,] =^k^l 

hr = Gjkn - Gjkrs - Gjkn + Gjka, r = 1, . . . ,m2;r ^ s ^ t 



la 

la: 



X* 

2m2 2m2 



X — Gmimim2ni2 2 / ^ {Gijm2m2 Gmimim2m2 i 



'J)m2m2 

J ^— \m2 — 1 
+ 2 / ^ (GmimiH" Gfnimim2fn2^ 

aij = 

air 

f f . r-' 

i = 1, . . . , mi;} =/k =/l 



Gjjrs Gni^ni^rs j = \ 



dlj = ■ 
dlr = 



.nil — 1 
^>>"-f^2-2 ^ ^ ^ i,...,m2 - 1 

Gjirs~Gjkrs~Gjlrs+Gklrs 



Gjkn Gjhrs Gjkn "^Gjksi 



r =1, 



, mj) r =f s=ft 



Allele-count 



^0 — Gm-^m^m-lm-l [Gmxmxrs Gjjj^^rs Gtmkn + Gjkrs) 

-(G,fa„2„2 - Gjkrm2 - Gjtaji + Gjbrs],] ¥k ¥ "tl, ^ ¥^ ¥ ^1 

yrij = Gjkn - G,„,kn,j = I, . . . ,mi - l,k¥ j.nii 
^2t = Gjkn - Gjfa„js, r = I, . . . ,m2 - l,r ¥ s,m2 

Vlj ~ Gjjn Gjmirs 
nim, = G,„,„„„ - G, 
^2r = Gjkrr — Gjkniii 
'?2m2 ~ Gjlitu2m2 Gj 



- Gkm.n + Gjh-s,j =\,...,mi-l;k¥j,mi 

■im,n - Gkm,n + Gjkn.j ¥ k ¥ m 

Gjkmr + Gjkn, r = 1 , . . . , nii - I; s ¥ r, Mi 

IT - Gjkmis + Gjkn, r ¥^¥^2 



Allele-count 



2aL, + Km, 



'mimim2m2 



^ Imi 



2 ' 

_ G„, ^ „, ^ ,1- — Gjm , rj — Gkmi is ^Gjkrs 



71' ij = — '""^ — — = 1, . . . ,mi - l;fe nii 



=f'k=fmi 



71 2r = 



r = 1, 



, m2 — 1; r 7/ s, mx 



I Gjkm2m2 Gjk,„^ Gjfam^ +Gjte / / 

2m2 2 ' • T ^ r mx 

ij = Gjjrs ~ Gm-itn-irsi j = 1/ • ■ • < Wlj — 1 
V 2r ~ Gjkrr ~ Gjkm2m2i ^ - 1' • ■ • / ^2 ~ 1 
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= GaB - Gab locus 1 and 2, respectively. In a one- 
locus model at locus 1, however, we can show that the 
locus has its additive allelic effect a* = ai + Daa/CpApa), 
where D = P^b - PaPb is the LD between the two loci. 

Simulation Examples 

We use some numerical examples to illustrate proper- 
ties of the models we have discussed. First, we consider 
the same example discussed in [11] of a three-allele locus 
with allele frequencies pi = 0.2 for Ai, pi = 0.3 for A2, and 
j?3 = 0.5 for A3. The six genotypic values are Gn = 10, G12 
= 30, G22 = 50, Gi3 = 36, 623 = 46 and 633 = 42. We 
adopt a similar strategy to specify the genotype frequencies 
as: fj; =pj — D for j = 1, 2, 3 and Pj^- = 2pjpk + D for 7 ^ k, 
where D is a measure of departure from Hardy- Weinberg 
equilibrium (HWE) for the three alleles at the locus and 
D' <D<D* with 

D" = - min{2p,pfe} = -0.12 
and 

D* = mm{p}} = 0.04 

;=1,2,3 

We consider two cases: i) D = 0 for HWE, and ii) D 
= 0.02 for Hardy- Weinberg disequilibrium (HWD). 
The phenotypic value of an individual is simulated as a 
sum of its true genotypic value and an environmental 
noise from N{0, o ), where the o is chosen to be 
either 0 or <t^ = 288 with the latter one corresponds to 
a 20% heritability level when D = 0. For each of the 
four configurations, we simulate 10,000 random sam- 
ples with 1000 individuals each. For each random sam- 
ple, we fit the three fully parameterized one-locus 
models (4), (5) and (6) under model framework (2) 
using the least square approach and estimate the 
model parameters as well as the six genotypic values. 
The means and standard deviations (SD) of the least 
square estimates (LSE) of the model parameters and 
the six genotypic values from the 10,000 random sam- 
ples in fitting these three models are summarized in 
Table 8. 

As each of the three models provides a re-parameteriza- 
tion of the six genotypic values, for each random sample 
the three models always give exactly the same estimates of 
the six genotypic values and the residual variance as we 
expected, even though their model parameters are defined 
in different ways. As a result, under each configuration, the 
three models have the same means and SD for the LSE of 
the sbc genotypic values and the residual variance. Without 
environmental variation, each model can accurately esti- 
mate its model parameters and the six genotypic values for 
each random sample regardless of whether there is HWE 
or HWD. When there is environmental variation on the 



phenotypes, it is known that the least square estimators of 
the model parameters are unbiased under either HWE or 
HWD. However, the HWD may affect the variance of the 
least square estimators of the model parameters and the sbc 
genotypic values. Note that the genotypic frequencies are 
Pu = 0.04, P22 = 0.09, P33 = 0.25, P^ = 0.12, P13 = 0.20 and 
P23 = 0.30 under HWE, while with D = 0.02 the genotypic 
frequencies become Pn = 0.02, P22 = 0.07, P33 = 0.23, P12 = 
0.14, Pi3 = 0.22 and P23 = 0.32. So, under HWD, we tend to 
have more individuals carrying genotypes A1A2, A1A3, A2A3 
but less individuals carrying genotypes AiAi, A2A2, A3A3 in 
the random samples than that under HWE. Without know- 
ing the accurate genotypic values, more individuals with 
certain genotypes in a random sample can then provide 
better estimates of the corresponding genotypic values. This 
explains why under HWD the estimates of Gn, G22 and 
G33 have larger SD (or variances) than that under the 
HWE, and the estimates of G12, G13 and G23 under HWD 
have smaller variances than that under the HWE. 

As another example, let us consider the statistical mod- 
eling of two-locus genotypic values Gjk,-s> where the first 
locus have three alleles Ai, A2, A3 and the second locus 
have two alleles Bi, B2. Assume that the alleles at locus 1 
have the same allele frequencies as that in the previous 
example; i.e., pi = 0.2 for A^, p2 = 0.3 for A2, and p3 = 0.5 
for A3, while the two alleles at locus 2 have frequencies 
= 0.2 for El and ^2 = 0.8 for 82- The two-locus genotypic 
values G2 = (Gjia-s), j, k = I, 2, 3; r, s = 1, 2 are given by 



G2 = 



Gun Gui2 


Gu22 


G22n G2212 


G2222 


G33U G3312 


G3322 


Gl2U G1212 


G1222 


Gl3U G1312 


G1322 


G23U G2312 


G2322 



"10 10.9 9.6 

50 50.3 49.9 

42 42.6 41.2 

30 30.5 29.6 

36 36.8 35.4 
_46 46.7 45.2_ 

which are modified values from the previous one-locus 
model in a way that the Gj/^n = Gj/^, Gj/^12 = Gj/^ + 
eij/,and Gjk22 = Gj/, - e2jk^\\h eiy,tand ezy^^being some 
small positive fluctuations according to the genotypes 
B\B2 and £2^2 at locus 2. We assume Hardy- Weinberg 
equilibria at both loci and specify their haplotype fre- 
quencies as: hx\ = P\q\ - £>i, hx2 = P\q2 + ^21 = P2q\ 

- D2, h22 = P2q2 - D2, /Z31 = P3qi + {Dl - D2), h32 = P3q2 

- {Di - D2), where (and D2) are the linkage disequili- 
bria (LD) between alleles and B2 (and A2 and Bi) at 
the two loci. We consider two scenarios: i) Di = D2 = 0 
for linkage equilibrium (LE); and ii) Di = 0, D2 = 0.03 
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Table 8 Means (SD) of LSE for three one-locus models (4), (5) and (6) when m 


= 3. 






Allele 




a, 


ai 




-522 


<5,2 


a" 


True 


42 


-6 


4 


-20 


0 


-10 




0 = 0,0^ = 0 


42.0(0.00) 


-6.00(0.00) 


4.00(0.00) 


-20.00(0.00) 


0.00(0.00) 


-10.00(0.00) 


0.00(0.00) 


D = 0, = 288 


41.99(1.07) 


-5.98(1.61) 


3.99(1.44) 


-20.06(3.80) 


0.02(2.85) 


-9.98(2.42) 


287.84(12.91) 


D = 0.02, CT^ = 0 


42.00(0.00) 


-6.00(0.00) 


4.00(0.00) 


-20.00(0.00) 


0.00(0.00) 


-10.00(0.00) 


0.00(0.00) 


D = 0.02, CJ^ = 288 


41.98(1.14) 


-5.97(1.60) 


4.01(1.46) 


-20.07(6.21) 


0.03(3.09) 


-10.04(2.31) 


287.81(12.91) 




G„ 


G22 


G33 


G,2 


Gl3 


G23 






10 


50 


42 


30 


36 


46 






10.00(0.00) 


50.00(0.00) 


42.00(0.00) 


30.00(0.00) 


36.00(0.00) 


46.00(0.00) 






9.96(2.73) 


49.99(1.79) 


41.99(1.07) 


30.02(1.55) 


36.01(1.20) 


45.98(0.98) 






10.00(0.00) 


50.00(0.00) 


42.00(0.00) 


30.00(0.00) 


36.00(0.00) 


46.00(0.00) 






9.96(5.66) 


50.03(2.21) 


41.98(1.14) 


29.97(1 .38) 


36.01(1.12) 


45.99(0.93) 




F„ 


T 


a, 


02 


du 


^22 


d,2 




True 


30 


-16 


4 


10 


0 


-10 




0 = 0,0^ = 0 


30.00(0.00) 


-16.00(0.00) 


4.00(0.00) 


1 0.00(0.00) 


0.00(0.00) 


-10.00(0.00) 


0.00(0.00) 


D = 0, (T^ = 288 


29.98(1.64) 


-16.01(1.46) 


4.00(1.05) 


1 0.03(1 .90) 


-0.01(1.42) 


-9.98(2.42) 


287.84(12.91) 


D = 0.02, CT^ = 0 


30.00(0.00) 


-16.00(0.00) 


4.00(0.00) 


1 0.00(0.00) 


0.00(0.00) 


-10.00(0.00) 


0.00(0.00) 


D = 0.02, = 288 


29.99(3.05) 


-16.01(2.88) 


4.03(1.25) 


10.04(3.10) 


-0.01(1.54) 


-10.04(2.31) 


287.81(12.91) 




G„ 


G22 


G33 


G,2 


G,3 


G23 






10 


50 


42 


30 


36 


46 






10.00(0.00) 


50.00(0.00) 


42.00(0.00) 


30.00(0.00) 


36.00(0.00) 


46.00(0.00) 






9.96(2.73) 


49.99(1.79) 


41.99(1.07) 


30.02(1.55) 


36.01(1.20) 


45.98(0.98) 






10.00(0.00) 


50.00(0.00) 


42.00(0.00) 


30.00(0.00) 


36.00(0.00) 


46.00(0.00) 






9.96(5.66) 


50.03(2.21) 


41.98(1.14) 


29.97(1 .38) 


36.01(1.12) 


45.99(0.93) 




Allele-count 


"o 


"1 


"2 


lu 


'722 


1712 


a" 


True 


42 


-5 


4 


-32 


8 


-10 




D = 0, C7^ = 0 


42.00(0.00) 


-5.00(0.00) 


4.00(0.00) 


-32.00(0.00) 


8.00(0.00) 


-10.00(0.00) 


0.00(0.00) 


D = 0, = 288 


41.99(1.07) 


-5.98(1.61) 


3.99(1.44) 


-32.03(2.92) 


8.00(2.09) 


-9.98(242) 


287.84(12.91) 


D = 0.02, CT^ = 0 


42.00(0.00) 


-6.00(0.00) 


4.00(0.00) 


-32.00(0.00) 


8.00(0.00) 


-10.00(0.00) 


0.00(0.00) 


D = 0.02, = 288 


41.98(1.14) 


-5.97(1.60) 


4.01(1.46) 


-32.02(5.76) 


8.05(2.51) 


-10.04(2.31) 


287.81(12.91) 




G„ 


G22 


G33 


G12 


G,3 


G23 






10 


50 


42 


30 


36 


46 






10.00(0.00) 


50.00(0.00) 


42.00(0.00) 


30.00(0.00) 


36.00(0.00) 


46.00(0.00) 






9.96(2.73) 


49.99(1 .79) 


41.99(1.07) 


30.02(1.55) 


36.01(1.20) 


45.98(0.98) 






10.00(0.00) 


50.00(0.00) 


42.00(0.00) 


30.00(0.00) 


36.00(0.00) 


46.00(0.00) 






9.96(5.66) 


50.03(2.21) 


41.98(1.14) 


29.97(1 .38) 


36.01(1.12) 


45.99(0.93) 





for LD. The phenotypic value of an individual is still 
simulated as a sum of its genotypic value and an envir- 
onmental noise from N{0, o^), where the <7^ was chosen 
to be either 0 or a = 286 with the latter one corre- 
sponds to a 20% heritability level when Di = D2 = 0. 
For each of the four configurations, we simulate 10,000 
random samples with 1000 individuals each. For each 
random sample, we consider fitting models under three 
model frameworks: i) one-locus models (4), (5) and (6) 
at locus 1 under model framework (2); ii) two-locus 
models without epistases from the three coding schemes 
under model framework (14); iii) fully parameterized 
two-locus models (13), (14) and (15) with epistases. Still, 



for each random sample, the three allele coding models 
under the same model framework give exactly the same 
estimates of the 18 genotypic values as we expected 
(results not shown here). As the result, under each 
model framework, the three models have the same 
means and SD for the LSE of the 18 genotypic values 
and the residual variance, although the means and SD 
for the LSE of their model parameters are different. To 
compare the LSE of model parameters for models from 
the same coding under different model frameworks, we 
summarize in Table 9 the means and SD of the LSE of 
the model parameters from the 10,000 random samples 
in fitting the three allele-coding models: the one-locus 
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Table 9 Means (SD) of LSE for three allele-coding models regarding the two-locus genotypic values 



One-locus model 




a^^ 


a^2 


<5ni 


"Si 22 


'5n2 


a" 


True 


41.68 


-5.81 


4.03 


-20.03 


0.29 


-10 




Di = D2 = 0, C7^ = 0 


41.68(0.04) 


-5.81(0.06) 


4.03(0.06) 


-20.03(0.14) 


0.29(0.09) 


-1 0.00(0.08) 


0.37(0.01) 


D, = D2 = 0, a' = 286 


41.69(1.07) 


-5.83(1.61) 


4.03(1.44) 


-20.00(3.79) 


0.27(2.85) 


-9.99(2.43) 


286.58(1 2.82) 


True 


41.55 


-5.74 


4.21 


-20.04 


0.09 


-10.06 




D, = 0, D2 = 0.03, 0^ = 0 


41.55(0.04) 


-5.74(0.06) 


4.21(0.06) 


-20.04(0.14) 


0.09(0.09) 


-10.06(0.08) 


0.36(0.01) 


D, = 0, D2 = 0.03, = 286 


41.54(1.07) 


-5.74(1.61) 


4.23(1.45) 


-20.09(3.81) 


0.07(2.83) 


-10.09(2.43) 


286.27(1 2.94) 


Two-locus model - no epistases 


^ 


«ii 


a, 2 


<5iii 


15122 


15112 


a2i 


True 


41.88 


-5.81 


4.03 


-20.03 


0.29 


-10 


0.64 


Di = D2 = 0, C7^ = 0 


41.88(0.02) 


-5.81(0.01) 


4.03(0.01) 


-20.03(0.01) 


0.29(0.05) 


-1 0.00(0.02) 


0.64(0.02) 


D, = D2 = 0, 0^ = 286 


41.91(2.88) 


-5.82(1.61) 


4.03(1.44) 


-19.99(3.79) 


0.27(2.85) 


-9.99(2.43) 


0.63(2.90) 




''211 


CT^ 














-1.92 
















-1.92(0.03) 


0.024(0.002) 














-1.92(3.41) 


285.64(12.79) 












True 


41.85 


-5.80 


4.06 


-20.04 


0.14 


-1 0.04 


0.65 


D, = 0, Dj = 0.03, C7^ = 0 


41.85(0.02) 


-5.80(0.01) 


4.06(0.01) 


-20.04(0.01) 


0.14(0.05) 


-10.04(0.02) 


0.65(0.02) 


D, = 0, D2 = 0.03, = 286 


41.87(2.94) 


-5.80(1.61) 


4.07(1.45) 


-20.09(3.81) 


0.12(2.83) 


-1 0.07(2.43) 


0.62(2.88) 




''211 


a" 














-1.92 
















-1.92(0.03) 


0.02(0.00) 














-1.88(3.38) 


285.36(12.94) 












Two-locus model with epistases 


^J 


ffii 


a,2 




'5l22 


l5ll2 


a2i 


True 


42 


-6 


4 


-20 


0 


-10 


0.6 


Di = D2 = 0, C7^ = 0 


42.00(0.00) 


-6.00(0.00) 


4.00(0.00) 


-20.00(0.00) 


0.00(0.00) 


-10.00(0.00) 


0.60(0.00) 


D, = D2 = 0, (T^ = 286 


41.92(5.73) 


-5.99(8.65) 


4.04(7.67) 


-19.79(19.86) 


-0.04(15.62) 


-9.82(13.54) 


0.66(6.04) 


D, = 0, D2 = 0.03, = 0 


42.00(0.00) 


-6.00(0.00) 


4.00(0.00) 


-20.00(0.00) 


0.00(0.00) 


-10.00(0.00) 


0.60(0.00) 


D, = 0, D2 = 0.03, = 286 


42.24(8.60) 


-6.11(11.83) 


3.66(10.05) 


-20.04(22.95) 


051(14.85) 


-9.77(14.64) 


0.38(8.86) 




15211 


(aiiffji) 


(«i 20:21) 


(^iiia2i) 


('5l22a2l) 


(^ii2a2i) 


(aii'52ii) 




-2 


0.2 


0.1 


-0.1 


-0.5 


-0.4 


-0.2 




-2.00(0.00) 


0.20(0.00) 


0.10(0.00) 


-0.10(0.00) 


-050(0.00) 


-0.40(0.00) 


-0.20(0.00) 




-2.05(6.99) 


0.23(9.12) 


0.07(8.08) 


-0.35(20.98) 


-0.47(16.35) 


-0.65(14.30) 


-0.29(1058) 




-2.00(0.00) 


0.20(0.00) 


0.10(0.00) 


-0.10(0.00) 


-0.50(0.00) 


-0.40(0.00) 


-0.20(0.00) 




-1.80(9.71) 


0.24(12.24) 


0.39(10.44) 


-0.03(24.03) 


-0.94(15.63) 


-052(15.27) 


-0.15(1352) 




(ttl 2-5211) 


('5iii<52ii) 


('5l22<521l) 


('5ii2'52n) 


a" 








-0.2 


0.2 


1.7 


1 










-0.20(0.00) 


0.20(0.00) 


1.70(0.00) 


1 .00(0.00) 


0.00(0.00) 








-0.18(9.39) 


055(2451) 


1.70(18.94) 


1.35(16.46) 


282.45(12.83) 








-0.20(0.00) 


0.20(0.00) 


1 .70(0.00) 


1 .00(0.00) 


0.00(0.00) 








-0.44(11.64) 


0.07(27.44) 


2.11(18.14) 


0.98(1 7.29) 


282.81(12.74) 







model (4), the two-locus model under model framework 
(14), and the two-locus model under model framework 
(13). Models from the other two coding schemes behave 
similarly. 

As we mentioned before, the one-locus models are 
actually modeling the expected genotypic values given 
the genotypes at locus 1. When Di = = 0, we can 



show that the expected genotypic values at locus 1 are 
Gu = 10.03, G22 = 50.03, G33 = 41.68, G12 = 29.90, G13 
= 35.87 and G23 = 45.71, which correspond to ^ = 
41.68, an = -5.81, = 4.03, (5ni = -20.03, 5^2 = 0.29 
and <5ii2 = -10 as the true parameters in the allele cod- 
ing one-locus model. When Di = 0, D2 = 0.03, the 
expected genotypic values at locus 1 become Gn = 
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10.03, G22 = 50.08, G33 = 41.55, G12 = 29.97, G13 = 
35.81 and G23 = 45.77, which correspond to fi = 41.55, 
an = -5.74, ai2 = 4.21, dm = -20.04, ^122 = 0.09 and 
'^112 = -10.06 as the true parameters in the allele coding 
one-locus model. In both cases, the least square estima- 
tors of the one-locus model parameters are unbiased 
estimators of the true parameters. Note that, unlike the 
one-locus model in the previous example, the LSE of 
the model parameters are no longer exactly the same as 
the true values even when no environmental noises are 
involved. The reason is that the expected genotypic 
values at locus 1 depend on not only the genotypic 
values but also the joint genotype frequencies in the 
sample, which may change slightly from sample to sam- 
ple due to the sampling variation. 

For the two-locus model without epistases, it cannot 
provide unbiased estimators for all the genotypic values 
because of the model mis-specification. However, the LSE 
of its parameters associated with locus 1 are similar to 
the ones in the one-locus model at locus 1. In fact, as we 
know from the linear model theory, the true values of its 
parameters associated with locus 1 are the same as the 
ones defined in the one-locus model at locus 1 when the 
two loci are in LB. Under LD, the least square estimators 
of its model parameters associated with locus 1 could be 
biased, and the biasness depends on the LD setting. 

The two-locus model with epistases gives a full re- 
parameterization of the 18 genotypic values. Therefore, 
when no environmental noises are involved, the LSE of 
its model parameters are exactly the same as their true 
values for each random sample regardless of the LD 
between the two loci. It has to be pointed out that this 
phenomenon holds only when the random sample con- 
tains all the 18 possible genotypes. In our simulation 
setting, the frequencies for certain genotypes such as 
A^AiBiBi, A1A3B1B1 and A2A2B1B1 are pretty small. As 
the result, we occasionally (about 22-23% of the 1000 
random samples) may obtain a random sample that has 
no individuals carrying certain genotypes. In this case, 
the design matrix in the fully parameterized model 
becomes singular and the LSE of the model parameters 
are no longer unique. To keep our illustration of the 
model properties simple, we excluded those random 
samples in fitting the two-locus model with epistases 
(reduced models are less likely to have singular design 
matrices). Other techniques such as ridge regression 
could be applied to handle those skewed random sam- 
ples. In the presence of environmental noises, it is also 
noted that the LSE for some of its model parameters 
such as ^111, ((5iiiO!2i) and ((5iii(52ii) have much larger 
SD than the LSE of other parameters. This is due to the 
low frequencies of genotypes AiAiBiBi, AiA^BiBi and 



A2A2B1B1. As a random sample has few individuals car- 
rying these genotypes, it has reduced accuracy in esti- 
mation of their corresponding true genotypic values 
to which the model parameters (Jm, ('5iiia2i) and 
((5iii<52ii) are related. 

Discussion 

In this study, we introduced three genotype coding 
schemes to build Fo,, models for multi-allele markers. 
The relationship between the model parameters and the 
expected genotypic values were established in some fully 
parameterized as well as reduced one-locus and two- 
locus Foo models. Our results showed that the relation- 
ships between the model parameters and the expected 
genotypic values could become more intricate in the 
multi-allele case than that in the biallelic case, even 
though the extension of the coding schemes from bialle- 
lic to multiple alleles appears straightforward. We built 
the relationships between different model parameters 
mainly through their coding variables of marker geno- 
types, which simplified the tedious derivation process 
comparing with the classical matrix approach. The F„ 
models we proposed can be used directly for association 
testing of multi-allele markers and their possible interac- 
tions with quantitative traits using random unrelated 
samples. These Foo models could also be applied to test 
for the risk haplotypes and their interactions when 
incorporated with the likelihood approach (e.g., [20]), or 
analyze family data by combining them with the likeli- 
hood to account for the transmission probability of 
alleles from parents to their offspring. Although our dis- 
cussion focused on genetic modeling of quantitative 
traits, the results can be extended to other phenotypic 
traits such as binary outcomes in case-control studies 
using logistic regression models or time-to-event data 
using the Cox proportional hazard models. 

Throughout the paper, we assumed that all the possi- 
ble genotypes are available from the sampled individuals. 
If certain genotypes are not observable, then the 
expected genotypic values on these genotypes will not 
be estimable by themselves, which could change the 
interpretation of the model parameters as well. The 
models we have presented can also be modified to han- 
dle the situation when some individuals have missing 
genotypes at certain marker loci. When the missing gen- 
otypes at a marker locus have both alleles missing at the 
same time, we can simply introduce an indicator vari- 
able to code for the missing genotype at the marker. 
The regression coefficient of this indicator variable for 
this missing genotype can usually be interpreted as the 
difference between the expected genotypic value with 
missing genotype at the marker locus and the intercept 
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of the model, while the other regression coefficients 
would keep the same interpretation as before. 

It has to be pointed out that the relationships 
between the model parameters and the expected geno- 
typic values are based on the assumption that the 
models can correctly specify the structure of the 
expected genotypic values. When a fully parameterized 
model is applied, the definition of its model para- 
meters do not depend on the allele frequencies, HWD 
among alleles within a locus, or LD structure between 
alleles at different loci. In fitting a reduced model, 
however, a simplified model may not be totally correct 
in modeling all the expected genotypic values. In this 
case, depending on how accurate the simplified model 
is on approximating the expected genotypic values, the 
allele frequencies, HWD and LD structure between 
marker alleles could affect the definition and LSE of its 
model parameters. In the presence of environmental 
variation on the phenotypic values, regardless of 
whether a fully parameterized or reduced model is 
applied, the allele frequencies, HWD or LD between 
marker alleles may affect the LSE of the model para- 
meters and the power in detection of the associated 
marker alleles as shown in our simulation studies. 

All the models we have discussed so far are mod- 
els. Statistically, these Fo„ models are fixed-effect models 
which focus on modeling the expected genotypic values 
directly. On the other hand, the Fisher's ANOVA mod- 
els, which target on evaluation of the variations contrib- 
uted by various allelic effects and interactions, can be 
treated as random-effect models (see [21]) in which the 
expected genotypic values come from a discrete random 
variable G{g) = E(G\g) with its limited genotypes g being 
randomly sampled from a study population. Both the F„ 
and the Fisher type models form basis in the analysis of 
quantitative traits and they provide different perspectives 
in assessing the genetic effects of QTL and markers. For 
biallelic markers, we proposed in [10] a 'mean corrected' 
Fisher (mc-Fisher) model for decomposition of the gen- 
otypic variances. In the multi-allele marker case, we can 
also construct similar mc-Fisher models by applying 
mean corrections on all the indicator variables of the 
paternal and maternal alleles in the allele coding F^ 
models. For example, based on the allele coding model 
(4), we can construct its corresponding mc-Fisher model 
by replacing the coding variables Wj and Vj/^ with 
Wj = Wj — 2pj and s,-), = (zy - ft)(z2i, - pi,] = vj}, - [pjWk + ph^f)/! + pjpt, 
respectively; where pj is the allele frequency of Aj. Then 
the genetic additive and dominant variance components 
Va and of G{g), which are defined as variations con- 
tributed by the additive allelic effect and allelic interac- 
tions respectively, can be estimated from wfs and Vjk's 



separately. As pointed out in [10], the mc-Fisher model 
can provide an orthogonal partition of V(G) into the 
sum of Va and Vd under Hardy- Weinberg equilibrium, 
and it can be fitted through the standard least-square 
regression approach. Similar to the F^ models, the defi- 
nition of the model parameters in such a mc-Fisher 
model also depend on the choice of the reference allele 
'Am'- But the estimates of the additive and dominant 
variance components Va and Vq do not depend on such 
a choice. In addition, when a fully parameterized model 
is applied, the mc-Fisher model is equivalent to its origi- 
nal F^ model in modeling the expected genotypic 
values. Therefore, both models have the same residual 
variance and the F-statistics in testing for the overall 
effect of the marker locus. When reduced models are 
applied, the mc-Fisher model could become inequivalent 
to its original Fo„ model especially when allelic interac- 
tions are involved. 

Of the three coding schemes that we have discussed, 
the F^ coding is perhaps the most widely used in cur- 
rent genetic association studies of quantitative traits. 
From what we have shown, the three coding schemes 
can essentially lead to equivalent models and have the 
same power in detection of various genetic effects. In 
practice, just like the various existing coding schemes 
such as 'Reference', 'GLM' and 'Effect' that are com- 
monly used in the analysis of categorical covariates 
[22], we usually only need to adopt one specific coding 
scheme in building the regression models. Which cod- 
ing scheme should be applied depends on how conve- 
nient it can provide the statistical inferences on the 
parameters of our research interests. In general, the 
allele coding models can provide direct estimates of 
certain substitution effects of alleles and allelic interac- 
tions and, in the two-locus case, allele coding models 
are perhaps the easiest among the three codings in 
building the relationships between their model para- 
meters and the expected genotypic values. Besides, 
they are generically linked to the genetic variance com- 
ponents as we have shown above. On the other hand, 
the allele-count coding models are attractive in that it 
often leads to simple comparisons among the three 
genotypic groups with 0, 1 or 2 copies of a particular 
allele. In the two-locus case, the allele-count coding 
models also have the definition of their model para- 
meters remain as simple as (if not simpler than) that 
in the allele coding models even in the presence of 
epistases. Meanwhile, both the allele and allele-count 
coding show an advantage that their lower-order main 
effects in the models can keep the same interpretation 
regardless of whether there are epistases involved in 
the model or not. In contrast, the F^ coding models 
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may have the definition of their lower-order main 
effects vary depending on the absence or presence of 
epistases in the models. Even though the one-locus F^o 
coding model parameters are closely related to the 
additive and dominance effects, the two-locus cod- 
ing model parameters including the lower-order main 
effects have more complicated interpretations than 
that in the allele or allele-count coding models espe- 
cially when epistases are involved. 

The coding of marker genotypes are not limited to 
the three allele-based coding schemes that we have 
discussed. Application of a coding scheme could also 
be subject to the number of individuals available in 
each genotype group. For example, under the model 
framework (7), the allele coding scheme typically cre- 
ates Wjig), Vj(g) for each allele type Aj, j = 1, m. 
When the group of a homozygous genotype AjAj 
includes very few individuals for a particular allele Aj, 
we may want to combine this genotypic group with 
another genotype such as the one carrying one copy of 
the allele Aj. Then we can replace the original Wj{g) 
and Vjig) by an allele presence-absence coding variable 
djig) for this specific allele Aj while keeping two coding 
variables Wf;{g), V/tig) for other alleles A/^, which leads to 
a mixed use of the allele coding and this allele pre- 
sence-absence coding variable. In certain situations, 
the genotype-based coding could also be very useful as 
it can provide direct tests on pair-wise comparisons of 
certain genotypic values. Comparing with the geno- 
type-based coding, the allele coding has the advantage 
of further dissecting the genetic effects into the allelic 
effects and allelic interactions, which allow us to spe- 
cify reduced models with varying degrees of interac- 
tions among the main allelic effects - a useful tool in 
the model building procedures. Given a fixed coding, 
the likelihood ratio test can be applied to compare a 
full model with its reduced models. Statistical model 
selection tools such as AIC and BIC criteria, which 
provide a balance between the goodness of model fit- 
ting to the data and the complexity of the models in 
terms of the number of parameters, could also be used 
to compare some non-nested reduced models or fra- 
meworks. The current study focuses on establishing 
the theoretical relationships between the model para- 
meters and the expected genotypic values according to 
different coding schemes under various model frame- 
works. A power comparison of some reduced models 
from different coding schemes under various scenarios 
with respect to the allele frequencies and possible 
HWD or LDs between marker alleles is beyond the 
scope of this study and might be worth of further 
exploration. 



Conclusions 

In summary, we introduced three allele-based coding 
schemes to construct Fo„ models for association test- 
ing of multi-allele genetic markers with quantitative 
traits. Depending upon whether certain allelic effects 
or comparisons between genotypic groups are of the 
main research interest, investigators may adopt one of 
the three allele-based codings (i.e., allele, F^ or allele- 
count), or perhaps a genotype-based coding in build- 
ing an ¥^ model. Based on the ¥^ model from a given 
coding scheme, standard regression model fitting 
tools can then be applied to estimate or test for var- 
ious genetic effects. Understanding the definition of 
model parameters from different coding schemes 
under various model frameworks are crucial for con- 
structing appropriate testing hypothesis and making 
the correct statistical inferences in the genetic asso- 
ciation studies. 

Appendices 

A. Estimability of parameters in model (8) 

Let G = (G{gi), ...,G{gM))^ denote a vector of the 
expected genotypic values of all the individuals in the 
sample, and P* = (m*, a^, . . . , «*„, &\, . . . , 5* ) be a vec- 
tor of all the model parameters. We can rewrite model 
(8) in a matrix form as G = Xfi" +e, where e = (ci, 
fijv) and the design matrix X is 



(18) 



with Wj = (wjigj), Wj{gN)f and Vj = (v,(gi), Vj{gN)f 
for ) = 1, m. As every individual carries two and only 
two alleles at the locus, we have YlJ^'i = 2 ■ In, which 
means that the first (m+1) column vectors l^v, Wi, W2, 
of the design matrix X are linearly dependent. So, 
rank{X} < 2m; i.e., X is not a fiill column rank matrix. 

From (7), we have Gju = + a* + a*^ for j ^ k, and 
Gjj = /X* + 2a* + S*. If we write Go = (G12, G13, Gi„, 
G23, -, Gi_i, i, Gu, G,„^)^, then this model gives 



0. 

mxm ^mxm 



where 5 = m{m - l)/2, and 

1 1 0 ... 0' 
1 0 1 ... 0 



1 0 0 ... 1 
0 1 1 ... 0 

0 0 0 ... 1 
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Assume that the genotypes of the sampled individuals 
cover all possible genotypes AjA^- for /, k = 1, m. Then the 
design matrix X includes all the row vectors of Xq, which 
implies that ranl((X) > mnI({Xo). It is clear that mnl((Xo) = m 
+ mnk{[ls Xj^]), and it can be shown that mnk([ls X^]) = 
rankiX^ = m when m > 3. Therefore, mnk(X) = 2m as m > 
3. Note that when w = 2, we have s = 1 and rank{X) = 3. 

From the linear models theory, we know that for a 
vector A = (Aq, ^2mf e R^'"*\ a linear function A^y3* 
of is estimable if and only if kJ-Af[X], where 
Af[X) = {c e R2m+i = 0} is the null space of the 
design matrix X. It is also known that 
AfiX) ® n{X) = J?^"'^!, where n{X) is a linear space 
generated by the row vectors of X. Hence, we have 
rank[Af{X)) = (2m + 1) - rank{TZ{X)) = 1. Note that 
c = (2, 0^)' e Af[X) due to the linear dependency 
among the column vectors Ij^, Wi, W^, in the 

design matrix X. Therefore, for a vector A = (Ao, Ai, 
A2„)^e R^"'^\ the linear function A^/3* is estimable if 
and only if A ± c, or equivalently, 2Xo = Yljli ^j- As a 
result, we know that in model (8) the functions of 
model parameters Gjk = fi* + a* + a,* for ; ?; k, and 
Gjj = M* + 2a* + S* for J = 1, m are estimable, and the 
parameters S* = Gjj - (/x* + 2a!*) = Gjj + Gu - Gji - Gjk 
as i * k, I and k * I (or in abbreviation, j ^ k, * I) for / = 
1, m are also estimable. But the parameters and 
af, . . . ,a* themselves are not estimable. 



B. Estimability of parameters in model (9) 

For model (9), we have its design matrix 

W = [InWi W2 ■ ■ • W„_i Vi V2 ■■■Vm] 

where = [wjig-^), Wj(gM)V and V,- = (v;(^i), Vj 
(gM))^ for / = 1, m. It can be shown that the Wand the 
design matrix X defined in (16) for model (8) have the 
following relationship W = XT or X = WS^, where 



and 





0 


Olxm 


Olxm 


0 






d 



Olxm 0 
Omxm OfHxl 



Omxm 
Olxm 



with d = (2, -1, -1)' G i?"". Let P = ai, a,„_i, 
Si, Syyi). Therefore, as (8) and (9) are two equivalent 
models, we have G = Xf}' = WS^P'' = Wp, which yields 







u* + 2a* 

• in 


oil 




a* - a* 


Olm-1 


= S^/S* = 


<y* — a* 


Si 






_ S,„ _ 







From this relationship, we have Sj - S*^ j = 
which are estimable as shown in Appendix A. Besides, 



the intercept = jJi* + 2a'^ = Gjm + Gkm 
= ~ Gkm, k * j, j = 1, m 



aj = aj 
estimable. 



- Gfli and 
1, are also 



C. Relationships for fully parameterized two-locus models 
(C.l) Relationships between parameters of the fully 
parameterized two-locus model (13) and the expected 
genotypic values are 



Gjmim2tn2 



II -- 

C/2r — Gmimirnij 
^Ijk — Gjkm2ni2 
— Gj}ini2m2 

+G, 

Sirs = 



= 

- M = 

{_Gjmini2nt2 + Gtnikm2m2^ 



mimim2ni2 



~ Gmimin 

(«lj«2r) = Gjmirm2 
~ Gjfnirm2 ~ 
+Gm, mi m->m 



■ «2s 



{Gjmifii2m2 + Gm^mirm2) 



Ijk 



~ Gjkrmz {_Gjkm2m2 Gjfn\rm2 
+G/;mirm2) + {Gjm\m2m2 G}zm\m2m2 
+Gm, m^rm^i Gn 



IjOlrs 1 



-^mimirm2 J ^mimim2m2 
= Gjmjn — Q'2r — Q'2s — Sjn 
-aij - (aijajr) - {aija2s) - 
GjmiTS 



{^GfmmiTS + Gjfnirm2 
■ Gw 



+Gjmi5m2) + (Gjmittiimi Gtmmirm2 

+G„ 



'mimism2 7 ^mimimimi 
('5l;ki52K) = Gjte - aij — ttife — Syj — Q!2r - "25 
-^215 - («lj«2r) - («y«25) - [aikOllr) 
-(fflfc«2s) - («lj'52r5) - (aikSlrs) 
-{Sljkajr) - [SijkOljs) - l-i 
~ Gjkrs (Gjmirs Gkmirs + Gjkrm2 
'^Gjksm2^ (Gjfefrj2m2 + Gjniirm2 
+Gfemirm2 Gjfnism2 Gkmism2 
+Gmimirs) (Gjmim2m2 Gkmim2m2 



for y, = 1, mi - 1; r, s = 1, W2 - 1 and j > k,r < s. 

(C.2) Relationships between parameters of the fully 
parameterized two-locus model (14) and the expected 
genotypic values are 
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2 
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(dljjd2rs) 
(dljkd2iT) 
{dijjdirr) 

[dijkdirs) = [SijkSirs),] < k,r < j 



2 — / r < s 

[SlijS2n) 



for /, /c = 1, mi - 1 and r, s = 1, mi - 1, where 
the relationships between the parameters of model (14) 
and model (13) are built based on the equivalency 



between the two models. The relationships between the 
parameters of model (14) and the expected genotypic 
values can then be derived by replacing the parameters 
of model (13) with the expected genotypic values from 
the previous established results in (C.l). 

(C.3) Relationships between parameters of the fully 
parameterized two-locus model (15) and the expected 
genotypic values are 
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^mimim2m2 
{lljjrilrs) = 2{aij82rs) + {8lijS2rs) 



miniirs + Gjjrm2 GjJ5m2 ) 



•]jm2m2) 



,r < s 



{^Gjkni2ni2 Gjtmrr + Gkniiir^ 



^miniim2tn2 
(riljkniiT) = 2((5ijfea2r) + {Sljkhrr) 
= Gjki 

+ (Gjmim2m2 Gki 
Gmimim2m2'j ^ ^ 

{riijkr]2rs) = (5ijk52n),i <k,r < s 



for j, k = 1, mi - 1 and r, s = 1, mi - 1, where 
the relationships between the parameters of model (15) 
and model (13) are built based on the equivalency 
between the two models. The relationships between the 
parameters of model (15) and the expected genotypic 
values are then derived by replacing the parameters of 
model (13) with the expected genotypic values from the 
previous established results in (C.l). 
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